runZinb <- T
runClus <- T
NCORES <- 7

FP: Version of zinbwave to use is last version of branch normvalues (branched from master). Parallel computing is now handle by BiocParallel. If you have a Windows machine, please update the code to allow paralell computing.

mysystem = Sys.info()[['sysname']]
switch(mysystem,
       Windows = {print("I'm a Windows PC and you should choose the parallel computing you want.")},
       Linux = {print("I'm a penguin and I'll use package MulticoreParam for parallel computing.")},
       Darwin = {print("I'm a Mac and I'll use package doParallel for parallel computing.")})
## [1] "I'm a Mac and I'll use package doParallel for parallel computing."
if (mysystem == 'Darwin'){
  registerDoParallel(NCORES)
  register(DoparParam())
}else if (mysystem == 'Linux'){
  register(bpstart(MulticoreParam(workers=NCORES)))
}else{
  print('Please change this to allow parallel computing')
  register(SerialParam())
}

EAP: small request. Can everyone put a line between the beginning of a r chunk and text? It makes it nicely formated for my text editor.

Steps of the workflow

We propose a worklow to analyze single cell RNA-Seq with the following steps

  • Dimensionality reduction using zinbwave to get W which should capture the biology,
  • Cluster cells using clusterExperiment on W to get the cluster labels,
  • Get lineage using slingshot on W and cluster labels from clusterExperiment,
  • Get DE genes between lineages/clusters.

Along the worflow, use deviance residuals as adjusted values.

knitr::include_graphics('../vignettes/workflow.png')
Worflow to analyze single cell RNASeq data

Worflow to analyze single cell RNASeq data

1. Create a SummarizedExperiment object

Along the workflow, we want to use a unique SummarizedExperiement object carrying all the data we need.

EAP: I have updated the code to pull from a dataset on the repos that is created with the createData.R file. For now, I am filtering to the top 1000 most variable genes there, though we might want to add that to the code for the article. This will be slightly different data from Russell’s, which didn’t use all of the samples. We can adjust that decision later, or just compare the samples that are the same. (Russell’s clusterLabels are in the meta data)

EAP: zinbFit doesn’t accept data.frame objects, so currently have to have a data.matrix command. Should it be changed so that it does?

#counts<-read.table("../data/oeCufflinkCountData.txt",sep="\t",header=TRUE)
core <- read.table("../data/oeCufflinkCountData_1000Var.txt",
                   sep = "\t", header = TRUE)
core <- data.matrix(core)
metadata <- read.table("../data/oeMetadata.txt", sep = "\t", header = TRUE)
# symbol for samples missing from original clustering
metadata$clusterLabels[is.na(metadata$clusterLabels)] <- -2

Here we only look at the 1000 most variable genes. EAP: see note above, I’ve commented out the filtering and added it to the createData.R for now.

batch <- metadata$Batch

Cells have been processed in 18 different batches

col_batch = rep(brewer.pal(9, "Set1"), 2)
names(col_batch) = unique(batch)
table(batch)
## batch
## GBC08A GBC08B GBC09A GBC09B    P01    P02   P03A   P03B    P04    P05 
##     41     47     43     38     34     49     73     52     24     23 
##    P06    P10    P11    P12    P13    P14    Y01    Y04 
##     53     51     54     51     60     49     65     42

We have qc measures from the data

qc <- metadata[, !names(metadata) %in% c("Batch", "Experiment", "clusterLabels")]
head(qc, 2)
##                  NREADS NALIGNED  RALIGN TOTAL_DUP    PRIMER
## OEP01_N706_S501 3313260  3167600 95.6035   47.9943 0.0154566
## OEP01_N701_S501 2902430  2757790 95.0167   45.0150 0.0182066
##                 PCT_RIBOSOMAL_BASES PCT_CODING_BASES PCT_UTR_BASES
## OEP01_N706_S501               2e-06         0.200130      0.230654
## OEP01_N701_S501               0e+00         0.182461      0.201810
##                 PCT_INTRONIC_BASES PCT_INTERGENIC_BASES PCT_MRNA_BASES
## OEP01_N706_S501           0.404205             0.165009       0.430784
## OEP01_N701_S501           0.465702             0.150027       0.384271
##                 MEDIAN_CV_COVERAGE MEDIAN_5PRIME_BIAS MEDIAN_3PRIME_BIAS
## OEP01_N706_S501           0.843857           0.061028           0.521079
## OEP01_N701_S501           0.914370           0.033350           0.373993
##                 CreER ERCC_reads
## OEP01_N706_S501     1      10516
## OEP01_N701_S501  3022       9331
clus.labels <- metadata[, "clusterLabels"]

In original work (FP: add ref), cells have been clustered into 14 different clusters

col_clus <- c("transparent", brewer.pal(12, "Set3"), brewer.pal(8, "Set2"))
col_clus <- col_clus[1:length(unique(clus.labels))]
names(col_clus) <- sort(unique(clus.labels))
table(clus.labels)
## clus.labels
##  -2   1   2   3   4   5   7   8   9  10  11  12  14  15 
## 233  91  25  56  40  96  60  28  79  26  22  35  26  32

Batches are kind of confounded with the biology

table(data.frame(batch = as.vector(batch),
                 cluster = clus.labels))
##         cluster
## batch    -2  1  2  3  4  5  7  8  9 10 11 12 14 15
##   GBC08A  5  0  2 12  9  0  0  0  0  0  2  0  2  9
##   GBC08B 14  0  7  5  3  0  0  0  1  2  4  0  5  6
##   GBC09A 13  0  1  5  9  0  0  0  1  1  0  0  6  7
##   GBC09B 21  0  2  2  7  0  0  0  3  0  0  0  3  0
##   P01     9  0  2  4  3 15  1  0  0  0  0  0  0  0
##   P02     6  2  0  9  3 15  3  3  2  3  0  2  1  0
##   P03A   36  3  0  3  0 12  2  9  4  2  0  2  0  0
##   P03B   19  1  2  1  1 11  1  2 10  1  1  2  0  0
##   P04    10  0  0  0  0 11  1  0  1  1  0  0  0  0
##   P05     3  0  0  0  1 11  3  0  1  0  2  2  0  0
##   P06    14  1  2  3  0  8  2  4  8  4  1  2  2  2
##   P10    15  3  1  4  0  4  5  9  2  0  2  5  0  1
##   P11    10  2  1  1  0  1  5  1 22  3  1  6  0  1
##   P12    11  0  2  0  0  4 10  0  8  2  3  6  4  1
##   P13    13  1  2  4  0  4 15  0  4  5  6  1  3  2
##   P14    10  0  0  1  2  0 12  0 12  2  0  7  0  3
##   Y01    14 47  1  1  2  0  0  0  0  0  0  0  0  0
##   Y04    10 31  0  1  0  0  0  0  0  0  0  0  0  0

We have 849 cells.

dim(core)
## [1] 1000  849
core[1:3, 1:3]
##        OEP01_N706_S501 OEP01_N701_S501 OEP01_N707_S507
## Cbr2              5799            3638            1448
## Cyp2f2            2158            2027            1078
## Gstm1             8763            7221            3581

Let’s create a SummarizedExperiment object to store the raw counts and information about the data, that is batches, original labels, and quality control measures.

se <- SummarizedExperiment(assays = list(rawCounts = core),
                           colData = metadata)

2. Dimensionality reduction adjusting for gene and cell-level covariates

To cluster and get lineages we want to reduce the dimension of the data. We are going to use zinbwave to do so. First, let’s fit zinbwave with first K = 0 to compute normalized values (i.e. deviance residuals) adjusted for batches. We could also adjust for gene length or GC content here. We then fit zinbwave to get the dimensionality reduced matrix W specifying the number of dimension K = 50. Eventually, we will call zinbwave just once where we would have an argument in zinbFit like “compute_normalized_values” in c(TRUE, FALSE). For K = 0 and K = 50, we correct for batch effect including batches in X.

fn <- '../data/zinb_batch.rda'
if (runZinb & !file.exists(fn)){
  print(system.time(se <- zinbDimRed(se, K = 50, X = '~ Batch',
                                       residuals = T,
                                       normalizedValues = T)))
  save(se, file = fn)
}else{
  load(fn)
}

Normalized values

We use deviance residuals as normalized values for visualization. FP: explain rational: K=0 so residuals capture the bio adjusting for batch. Let’s check that deviance residuals look ok.

FP: note to myself: why do we have infinite values in the residuals now? It shows the same results as before, but we should not see infinite values here!

norm <- assays(se)$normalizedValues
if (sum(is.infinite(norm))>0){
  maxNorm = max(norm[!is.infinite(norm)])
  assays(se)$normalizedValues[is.infinite(norm)] <- maxNorm
  norm <- assays(se)$normalizedValues
}
norm[1:3,1:3]
##        OEP01_N706_S501 OEP01_N701_S501 OEP01_N707_S507
## Cbr2          4.533210        4.365844       -4.136863
## Cyp2f2        4.355783        4.321493        4.117663
## Gstm1         4.728341        4.624404        4.400054

Boxplot of the normalized values for each cell. It seems that correction for batches is ok.

norm_order <- norm[, order(as.numeric(batch))]
col_order <- as.numeric(batch)[order(as.numeric(batch))]
boxplot(norm_order, main='Boxplot of normalized values\ncolor=batch',
        col = col_order, staplewex = 0, outline = 0, border = col_order,
        xaxt = 'n')

PCA on the normalized values where color are for batches on the left and previously found clusters on the right. We want no clustering on the left side and clustering on the right side.

pca <- prcomp(t(norm))
par(mfrow = c(1,2))
plot(pca$x, col = col_batch[batch], pch = 20,
     main="PCA of normalized values\ncolor=batch")
plot(pca$x, col = col_clus[as.character(clus.labels)], pch = 20,
     main = "PCA of normalized values\ncolor=cluster")

par(mfrow = c(1,1))

Dimensionality reduction

Let’s check that performing MDS on W we have something coherent with original clusters.

W <- colData(se)[, grepl('^W', colnames(colData(se)))]
W <- as.matrix(W)
d <- dist(W)
fit <- cmdscale(d, eig = TRUE, k = 2)
plot(fit$points, col = col_clus[as.character(clus.labels)], main = 'MDS', pch = 20,
     xlab = 'Component 1', ylab = 'Component 2')
legend(x = 'bottomright', legend = unique(names(col_clus)), cex = .5,
       fill = unique(col_clus), title = 'Sample')

2. Clustering of the cells

We use clusterExperiment with W.

EP: I updated it to work on a SE object so that it has the meta data. If you have a summarized experiment object with W already, you could use that as long as assay(seObj) gives W.

fn <- '../data/RSEC_W.rda'
if (runClus & !file.exists(fn)){
  #symbol for samples missing from original clustering
    metadata$clusterLabels[is.na(metadata$clusterLabels)] <- -2 
  seObj <- SummarizedExperiment(t(W), colData = metadata)
  print(system.time(ceObj <- RSEC(seObj, k0s = 4:15, alphas = c(0.1),
                                  betas = 0.8,
                clusterFunction = "hierarchical01", minSizes=1,
                ncores = NCORES, isCount=FALSE,
                subsampleArgs = list(resamp.num=100,
                                     clusterFunction="kmeans",
                                     clusterArgs=list(nstart=10)),
                seqArgs = list(k.min=3, top.can=5), verbose=TRUE,
                combineProportion = 0.7,
                mergeMethod = "none")))
  save(ceObj, file = fn)
}else{
  load(fn)
}
plotClusters(ceObj, colPalette = c(bigPalette, rainbow(199)))

plotCoClustering(ceObj)
## Warning in .makeColors(clusters, colors = bigPalette): too many clusters to
## have unique color assignments

table(primaryClusterNamed(ceObj))
## 
##  -1  c1 c10  c2  c3  c4  c5  c6  c7  c8  c9 
## 329 125  36   9  89  94  13  96   5  48   5
sum(primaryCluster(ceObj) == -1)
## [1] 329

FP: Elizabeth, we are working with the W here, does the locfdr make sense in this context? I set eval=FALSE in the next chunk to skip the merging step, let me know if you would rather keep using it. And if we want to still use the merging step, would we want to include it in RSEC function arguments instead of separately?

EP: I don’t think the merging step on the W makes a whole lot of sense – the method is irrelevant. The merging is based on calculating the % of genes found significant (the specific method is arbitrary). The best thing would be to replace the W with residuals in the assay of ceObj (or whatever data that you will do the DE on for the time stuff below), and then run the merging step on that data. I’m not particularly fond of locfdr. It was probably the method that gave the best merging to Russell and Diya. You’d really have to run mergeClusters setting plotInfo="all" and look at the results and decide both the cutoff level and the method.

EP: Also, if you don’t save the output of mergeClusters it doesn’t update ceObj. I was calling it for just the resulting plots, since it was already merged in RSEC above. I’ve changed to code to update ceObj below.

FP: Ha ok, good to know. I’ll keep the eval=FALSE for the moment.

#re-does merging simpling to make plot 
#something like:
#assay(ceObj)
# if that replacement data should be considered on the transformed scale in plots, etc, the transformation function should be fixed as well:
#transformation(ceObj)
ceObj<-mergeClusters(ceObj, mergeMethod = "locfdr",
              plotInfo = "mergeMethod", cutoff = 0.01)

So, let’s look at a heatmap on normalized values.

FP: Elizabeth, I did not find how to define the column annotation track in the plot below to have the same colors as in ceObj@clusterLegend[[1]]. I tried to use arguments annColors and annCol from aheatmap as it is said in plotHeatmap documentation that for signature matrix arguments can be passed to aheatmap. But I got the error “The following arguments to aheatmap cannot be set by the user in plotHeatmap:Rowv,Colv,color,annCol,annColors”.

EP: Fanny, you would need to use the argument ‘clusterLegend’. That argument takes either the format of aheatmap (list with each element a named vector of colors) or the format of the clusterExperiment object (i.e. list with each element a matrix with columns for name and color). So I think the following code will run, though it might need the list to have names…

But an easier fix to the code would be to set visualizeData option. I haven’t tested this because I don’t have the objects need run, so let me know if there is error.

FP: it seems great to me, what do you think?

EP: We should be careful, because the default in plotHeatmap is to plot the 500 most variable genes (maybe a slightly paternalistic default). I’ve changed it to all in the code here. I’ve also added the plotting of the batch, experiment, and Russell’s original clusters. We may not want to keep all of them, but probably at least Russell’s clusters for comparison.

# sampleData <- data.frame(ours = primaryCluster(ceObj))
# plotHeatmap(assays(se)$normalizedValues,
#             main = 'Normalized values, 1000 most variable genes',
#             clusterSamplesData = ceObj@dendro_samples,
#             sampleData = as.matrix(sampleData),clusterLegend=ceObj@clusterLegend[1])
# easier fix:
origClusterColors<-bigPalette[1:nlevels(colData(ceObj)$clusterLabels)]
experimentColors<-bigPalette[1:nlevels(colData(ceObj)$Experiment)]
batchColors<-bigPalette[1:nlevels(colData(ceObj)$Batch)]
metaColors<-list("Experiment"=experimentColors,"Batch"=batchColors,"clusterLabels"=origClusterColors)

plotHeatmap(ceObj, visualizeData = assays(se)$normalizedValues,
            whichClusters = "primary",clusterFeaturesData="all",
            clusterSamplesData = "dendrogramValue",
            sampleData=c("clusterLabels","Batch","Experiment"),
            clusterLegend=metaColors, annLegend=FALSE,
            main = 'Normalized values, 1000 most variable genes',
            breaks = 0.99)

plot(fit$points, col = col_clus[as.character(clus.labels)],
     main = 'MDS W, color = original clusters', pch = 20,
     xlab = 'Component1', ylab = 'Component2')
legend(x = 'bottomright', legend = unique(names(col_clus)), cex = .5,
       fill = unique(col_clus), title = 'Sample')

palDF <- ceObj@clusterLegend[[1]]
pal <- palDF[, 'color']
names(pal) <- palDF[, 'name']
pal["-1"] = "transparent"
plot(fit$points, col = pal[primaryClusterNamed(ceObj)],
     main = 'MDS W, color = our new clusters', pch = 20,
     xlab = 'Component1', ylab = 'Component2')
legend(x = 'bottomright', legend = names(pal), cex = .5,
       fill = pal, title = 'Sample')

4. Pseudotime ordering

The goal of this section is to see if we need to refit zinbwave when we want to run slingshot. We first run slingshot on the W used by clusterExperiment. In the second part of this section, we fit zinbwave on the matrix of counts where the unassigned cells have been removed. For each part (without or with refitting zinbwave), we run slingshot in the supervised and unsupervised mode and try k=3, k=4, k=5 dimensions in W.

From what I understand, start original clusters are 1 and 5 (HBC) and end original clusters are 15 (Microvillus), 9 and 12 (neuron), and 4, 7 (Sus). Additionally, we want the GBC cluster to be a junction before the differentiation between Microvillus and Neuron. The correspondance with the original clusters is as follow

table(data.frame(original = clus.labels, ours = primaryClusterNamed(ceObj)))
##         ours
## original  -1  c1 c10  c2  c3  c4  c5  c6  c7  c8  c9
##       -2 126  36   6   4  25   7  12   5   5   5   2
##       1   47  43   0   1   0   0   0   0   0   0   0
##       2    2   1   0   0   0  22   0   0   0   0   0
##       3    2   3   0   0   1  49   1   0   0   0   0
##       4   16   2   0   0  22   0   0   0   0   0   0
##       5   53  38   0   4   1   0   0   0   0   0   0
##       7   21   0   0   0  39   0   0   0   0   0   0
##       8   27   1   0   0   0   0   0   0   0   0   0
##       9   15   0   0   0   1   0   0  57   0   3   3
##       10   2   0   0   0   0   0   0   0   0  24   0
##       11   6   1   0   0   0  15   0   0   0   0   0
##       12   1   0   0   0   0   0   0  34   0   0   0
##       14   9   0   0   0   0   1   0   0   0  16   0
##       15   2   0  30   0   0   0   0   0   0   0   0
Cluster name Description Correspondence
c1 HBC original 1, 5
c2 new and small new and small
c3 new and small new and small
c4 GBC / immature neurons / MV 1 original 2, 3, 11, 14
c5 Sus original 4, 7
c6 Neuron original 9, 12
c7 Immature Neuron original 10, 14
c8 Immature Neuron original 14
c9 Microvillus original 15
Kvec <- c(3, 4, 5)

Use previous W

The input of slingshot is the W used for clusterExperiment where the number of dimensions is reduced to k where k in (3, 4, 5) here.

Unsupervised

K = 3 does not seem very good to me: Sus is not an end cluster, GBC is an end cluster.

K = 4 is better, slingshot finds the end clusters but there is a spurious end cluster.

K = 5 does not seem great to me: GBC is an end cluster and Sus and Microvillus are in the same lineage.

our_cl <- primaryClusterNamed(ceObj)
cl = our_cl[our_cl != "-1"]
pal = pal[names(pal) != '-1']
for (k in Kvec){
  X <- W[our_cl != "-1", 1:k]

  lineages <- get_lineages(X, clus.labels = cl, start.clus = "c1")
  curves <- get_curves(X, clus.labels = cl, lineages = lineages)
  plot_curves(X, cl, curves, col.clus = pal)
  plot_tree(X, cl, lineages, col.clus = pal)

  print(paste0("K=", k))
  print(lineages$lineage1)
  print(lineages$lineage2)
  print(lineages$lineage3)
  print(lineages$lineage4)
  print(lineages$lineage5)
}

## [1] "K=3"
## [1] "c1"  "c2"  "c7"  "c3"  "c10" "c4"  "c8"  "c6" 
## [1] "c1"  "c2"  "c7"  "c3"  "c10" "c4"  "c8"  "c9" 
## [1] "c1" "c5"
## NULL
## NULL

## [1] "K=4"
## [1] "c1" "c5" "c4" "c8" "c6"
## [1] "c1" "c5" "c4" "c8" "c9"
## [1] "c1" "c2" "c7" "c3"
## [1] "c1"  "c5"  "c4"  "c10"
## NULL

## [1] "K=5"
## [1] "c1" "c5" "c4" "c8" "c9" "c6"
## [1] "c1"  "c2"  "c7"  "c3"  "c10"
## NULL
## NULL
## NULL

Supervised

K = 3 finds GBC as an end cluster (that I did not specify in the end.clus!).

K = 4 Yeah! it seems that it is what we want even if we still have a spurius end cluster and GBC not really at the junction.

K = 5 Yeah! even if GBC not really at the junction.

for (k in Kvec){
  X <- W[our_cl != "-1", 1:k]

  lineages <- get_lineages(X, clus.labels = cl, start.clus = "c1",
                           end.clus = c("c3", "c6", "c10"))
  curves <- get_curves(X, clus.labels = cl, lineages = lineages)
  plot_curves(X, cl, curves, col.clus = pal)
  plot_tree(X, cl, lineages, col.clus = pal)

  print(paste0("K=", k))
  print(lineages$lineage1)
  print(lineages$lineage2)
  print(lineages$lineage3)
  print(lineages$lineage4)
  print(lineages$lineage5)
}

## [1] "K=3"
## [1] "c1" "c5" "c4" "c8" "c6"
## [1] "c1" "c5" "c4" "c8" "c9"
## [1] "c1" "c2" "c7" "c3"
## [1] "c1"  "c5"  "c4"  "c10"
## NULL

## [1] "K=4"
## [1] "c1" "c5" "c4" "c8" "c6"
## [1] "c1" "c5" "c4" "c8" "c9"
## [1] "c1" "c2" "c7" "c3"
## [1] "c1"  "c5"  "c4"  "c10"
## NULL

## [1] "K=5"
## [1] "c1" "c5" "c4" "c8" "c9" "c6"
## [1] "c1" "c2" "c7" "c3"
## [1] "c1"  "c5"  "c4"  "c10"
## NULL
## NULL

Re-fitting zinbwave

Unsupervised

K = 3 is better than when we did not refit zinbwave but still not perfect: Sus in all the clusters. GBC not really at the junction.

K = 4 good even if GBC not really at the junction.

K = 5 not great GBC is an end cluster.

fn <- '../data/refit_zinbwave_slingshot.rda'
if (runZinb & !file.exists(fn)){
  zinbList <- lapply(Kvec, function(k){
    zinbFit(se[, our_cl != "-1"], X = '~ Batch', K = k)
  })
  save(zinbList, file = fn)
}else{
  load(fn)
}
for(k in Kvec) {
  X <- getW(zinbList[[k - 2]])[, 1:k]

  lineages <- get_lineages(X, clus.labels = cl, start.clus = "c1")
  curves <- get_curves(X, clus.labels = cl, lineages = lineages)
  plot_curves(X, cl, curves, col.clus = pal)
  plot_tree(X, cl, lineages, col.clus = pal)

  print(paste0("K=", k))
  print(lineages$lineage1)
  print(lineages$lineage2)
  print(lineages$lineage3)
  print(lineages$lineage4)
  print(lineages$lineage5)
}

## [1] "K=3"
## [1] "c1" "c5" "c4" "c8" "c9" "c6"
## [1] "c1" "c2" "c3" "c7"
## [1] "c1"  "c5"  "c4"  "c10"
## NULL
## NULL

## [1] "K=4"
## [1] "c1" "c5" "c4" "c8" "c9" "c6"
## [1] "c1"  "c5"  "c4"  "c10"
## [1] "c1" "c2"
## [1] "c1" "c3"
## [1] "c1" "c7"

## [1] "K=5"
## [1] "c1" "c5" "c4" "c8" "c9" "c6"
## [1] "c1" "c2" "c3"
## [1] "c1"  "c7"  "c10"
## NULL
## NULL

Supervised

K = 3 Yeah! close to perfection.

K = 4 good even if GBC not really at the junction.

K = 5 no, GBC is an end cluster.

for(k in Kvec){
  X <- getW(zinbList[[k - 2]])[, 1:k]

  lineages <- get_lineages(X, clus.labels = cl, start.clus = "c1",
                           end.clus = c("c3", "c6", "c10"))
  curves <- get_curves(X, clus.labels = cl, lineages = lineages)
  plot_curves(X, cl, curves, col.clus = pal)
  plot_tree(X, cl, lineages, col.clus = pal)

  print(paste0("K=", k))
  print(lineages$lineage1)
  print(lineages$lineage2)
  print(lineages$lineage3)
  print(lineages$lineage4)
  print(lineages$lineage5)
}

## [1] "K=3"
## [1] "c1" "c5" "c4" "c8" "c9" "c6"
## [1] "c1"  "c5"  "c4"  "c10"
## [1] "c1" "c2" "c3"
## [1] "c1" "c7"
## NULL

## [1] "K=4"
## [1] "c1" "c5" "c4" "c8" "c9" "c6"
## [1] "c1"  "c5"  "c4"  "c10"
## [1] "c1" "c2"
## [1] "c1" "c3"
## [1] "c1" "c7"

## [1] "K=5"
## [1] "c1" "c5" "c4" "c8" "c9" "c6"
## [1] "c1" "c2" "c3"
## [1] "c1"  "c7"  "c10"
## NULL
## NULL

CONCLUSION: K = 5 is never great as GBC is generally an end cluster. K = 4 is ok for all the methods and a bit better when zinbwave is refitted. K = 3 when refitting and supervized is good.

It seems to me that using slingshot on W without re-fitting zinbwave with k = 4 gives good results where supervized mode is slightly better than unsupervized. It is just a one shot example and we should obviously not make a general conclusion, but I think that for the purpose of the workflow it is fine to use slingshot without refitting zinbwave. We should write a note to the user that it is better to refit zinbwave to have more power.

5. DE analysis

Here is the kind of plots we want to present

de <- read.csv('../data/oe_markers.txt', stringsAsFactors = F, header = F)
de <- de$V1
plotHeatmap(ceObj, 
            visualizeData = assays(se)$normalizedValues[rownames(se) %in% de, ],
            clusterSamplesData = "dendrogramValue",
            whichClusters = "primary",
            main = 'Normalized values, 1000 most variable genes',
            breaks = 0.99)

FP: Kelly, is it you who did the DE analysis for Russell paper? If yes, what tool did you use? On what data? The full quantile normalized counts? Do you have code available?

Session Info

sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.5
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] rARPACK_0.11-0               digest_0.6.12               
##  [3] RColorBrewer_1.1-2           Rtsne_0.13                  
##  [5] magrittr_1.5                 gplots_3.0.1                
##  [7] ggplot2_2.2.1                doParallel_1.0.10           
##  [9] iterators_1.0.8              foreach_1.4.3               
## [11] slingshot_0.0.3-5            princurve_1.1-12            
## [13] zinbwave_0.99.4.2            BiocParallel_1.10.1         
## [15] clusterExperiment_1.3.0-9009 scone_1.0.0                 
## [17] scRNAseq_1.2.0               SummarizedExperiment_1.6.3  
## [19] DelayedArray_0.2.4           matrixStats_0.52.2          
## [21] Biobase_2.36.2               GenomicRanges_1.28.3        
## [23] GenomeInfoDb_1.12.1          IRanges_2.10.2              
## [25] S4Vectors_0.14.2             BiocGenerics_0.22.0         
## 
## loaded via a namespace (and not attached):
##   [1] shinydashboard_0.6.0     R.utils_2.5.0           
##   [3] RSQLite_1.1-2            AnnotationDbi_1.38.0    
##   [5] htmlwidgets_0.8          grid_3.4.0              
##   [7] trimcluster_0.1-2        RNeXML_2.0.7            
##   [9] DESeq_1.28.0             munsell_0.4.3           
##  [11] codetools_0.2-15         statmod_1.4.29          
##  [13] scran_1.4.4              DT_0.2                  
##  [15] miniUI_0.1.1             colorspace_1.3-2        
##  [17] energy_1.7-0             knitr_1.16              
##  [19] uuid_0.1-2               pspline_1.0-17          
##  [21] robustbase_0.92-7        bayesm_3.0-2            
##  [23] NMF_0.20.6               tximport_1.4.0          
##  [25] GenomeInfoDbData_0.99.0  hwriter_1.3.2           
##  [27] rhdf5_2.20.0             rprojroot_1.2           
##  [29] EDASeq_2.10.0            diptest_0.75-7          
##  [31] R6_2.2.1                 ggbeeswarm_0.5.3        
##  [33] taxize_0.8.4             locfit_1.5-9.1          
##  [35] flexmix_2.3-14           bitops_1.0-6            
##  [37] reshape_0.8.6            assertthat_0.2.0        
##  [39] scales_0.4.1             nnet_7.3-12             
##  [41] beeswarm_0.2.3           gtable_0.2.0            
##  [43] phylobase_0.8.4          RUVSeq_1.10.0           
##  [45] bold_0.4.0               rlang_0.1.1             
##  [47] genefilter_1.58.1        splines_3.4.0           
##  [49] rtracklayer_1.36.3       lazyeval_0.2.0          
##  [51] hexbin_1.27.1            rgl_0.98.1              
##  [53] yaml_2.1.14              reshape2_1.4.2          
##  [55] abind_1.4-5              GenomicFeatures_1.28.1  
##  [57] backports_1.1.0          httpuv_1.3.3            
##  [59] tensorA_0.36             tools_3.4.0             
##  [61] gridBase_0.4-7           stabledist_0.7-1        
##  [63] dynamicTreeCut_1.63-1    Rcpp_0.12.11            
##  [65] plyr_1.8.4               visNetwork_1.0.3        
##  [67] progress_1.1.2           zlibbioc_1.22.0         
##  [69] purrr_0.2.2.2            RCurl_1.95-4.8          
##  [71] prettyunits_1.0.2        viridis_0.4.0           
##  [73] zoo_1.8-0                cluster_2.0.6           
##  [75] data.table_1.10.4        RSpectra_0.12-0         
##  [77] mvtnorm_1.0-6            whisker_0.3-2           
##  [79] gsl_1.9-10.3             aroma.light_3.6.0       
##  [81] mime_0.5                 evaluate_0.10           
##  [83] xtable_1.8-2             XML_3.98-1.7            
##  [85] mclust_5.3               gridExtra_2.2.1         
##  [87] compiler_3.4.0           biomaRt_2.32.0          
##  [89] scater_1.4.0             tibble_1.3.3            
##  [91] KernSmooth_2.23-15       R.oo_1.21.0             
##  [93] htmltools_0.3.6          pcaPP_1.9-61            
##  [95] segmented_0.5-2.0        tidyr_0.6.3             
##  [97] geneplotter_1.54.0       howmany_0.3-1           
##  [99] DBI_0.6-1                MASS_7.3-47             
## [101] fpc_2.1-10               MAST_1.2.1              
## [103] boot_1.3-19              compositions_1.40-1     
## [105] ShortRead_1.34.0         Matrix_1.2-10           
## [107] ade4_1.7-6               R.methodsS3_1.7.1       
## [109] gdata_2.17.0             igraph_1.0.1            
## [111] rncl_0.8.2               GenomicAlignments_1.12.1
## [113] registry_0.3             numDeriv_2016.8-1       
## [115] locfdr_1.1-8             plotly_4.7.0            
## [117] xml2_1.1.1               annotate_1.54.0         
## [119] vipor_0.4.5              rngtools_1.2.4          
## [121] pkgmaker_0.22            XVector_0.16.0          
## [123] stringr_1.2.0            copula_0.999-16         
## [125] ADGofTest_0.3            softImpute_1.4          
## [127] Biostrings_2.44.0        rmarkdown_1.5           
## [129] dendextend_1.5.2         edgeR_3.18.1            
## [131] kernlab_0.9-25           shiny_1.0.3             
## [133] Rsamtools_1.28.0         gtools_3.5.0            
## [135] modeltools_0.2-21        rjson_0.2.15            
## [137] nlme_3.1-131             jsonlite_1.4            
## [139] viridisLite_0.2.0        limma_3.32.2            
## [141] lattice_0.20-35          httr_1.2.1              
## [143] DEoptimR_1.0-8           survival_2.41-3         
## [145] FNN_1.1                  prabclus_2.2-6          
## [147] glmnet_2.0-10            class_7.3-14            
## [149] stringi_1.1.5            mixtools_1.1.0          
## [151] latticeExtra_0.6-28      caTools_1.17.1          
## [153] memoise_1.1.0            dplyr_0.5.0             
## [155] ape_4.1